The Influence of Semantics in Text Categorisation: A Comparative Study using the k Nearest Neighbours Method

نویسندگان

  • Edgardo Ferretti
  • Marcelo Luis Errecalde
  • Paolo Rosso
چکیده

In this paper we investigate different uses of semantics in text categorisation tasks. At this end, we consider distinct representations of documents which differ in the kind of information incorporated: a) information about terms only, b) semantic information (terms sense) and c) a combination of both types of information. Moreover, we study how the vocabulary size reduction affects this task. The k Nearest Neighbours method was used to perform the categorisation and the vocabulary size was reduced by means of the Information Gain technique. A number of different document codifications were tested. The experimental results showed that in corpora richer syntactically and semantically the inclusion of semantic information improves the text categorisation task if vocabularies with a sufficient number of features are considered.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic Text Categorization using the K Nearest Neighbours Method

In this paper we investigate the influence of semantics in the text categorization. Moreover, we study how the vocabulary size reduction affects this task. The K Nearest Neighbours method was used to perform the categorization. In order to reduce the vocabulary size, the Information Gain technique was employed. A number of different document codification alternatives were tested. The experiment...

متن کامل

Pseudo-Likelihood Inference Underestimates Model Uncertainty: Evidence from Bayesian Nearest Neighbours

When using the K-nearest neighbours (KNN) method, one often ignores the uncertainty in the choice of K. To account for such uncertainty, Bayesian KNN (BKNN) has been proposed and studied (Holmes and Adams 2002 Cucala et al. 2009). We present some evidence to show that the pseudo-likelihood approach for BKNN, even after being corrected by Cucala et al. (2009), still significantly underest...

متن کامل

Gender and Authorship Categorisation of Arabic Text from Twitter Using Ppm

In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), a...

متن کامل

Text Categorization and Information Retrieval Using WordNet Senses

In this paper we study the influence of semantics in the Text Categorization (TC) and Information Retrieval (IR) tasks. The K Nearest Neighbours (K-NN) method was used to perform the text categorization. The experimental results were obtained taking into account for a relevant term of a document its corresponding WordNet synset. For the IR task, three techniques were investigated: the direct us...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005